flowchart TD
A[Raw data\nsource 1] --> B(Data Cleaning)
C[Raw data\nsource 2] --> B
D[Raw data\nsource 3] --> B
B --> E{Cleaned Database}
E --> F(Analysis 1)
E --> G(Analysis 2)
My name is Mick.
Plan exactly what you will be collecting
before starting your data collection
= Any information collected, observed, generated or created in the process of your research
KEY RESOURCES:
Electronic Data
Hardcopy Data
Not appropriate
…coming soon - Quantitative Methods Workshop
Repeating the same mistake 10x is easier to fix than making 10 different mistakes…
| example 1 | example 2 | example 3 |
|---|---|---|
k002 |
1859 |
LTU295 |
k003 |
1739 |
LTU304 |
k004 |
1069 |
LTU205 |
k005 |
1204 |
LTU395 |
k006 |
3801 |
LTU591 |
... |
... |
... |
sex
height
date_surgery
Use consistent syntax:
l_leg_length
r_leg_length
dass_anxiety_1
dass_stress_2
Lleg_length
r_leg-length
dass_anxiety_q1
dass_scale_stress_2
2nd_doctor_name)_ to separate info in variable namingEspecially important for categorical variables:
E.g.: a variable in which we are recording handedness (right / left)
right, left, right, right, left, right
1, 2, 1, 1, 2, 1
R, L, R, R, L, R
right, l, Right, r, 1, right
1,2,3,4) instead of strings (right,left)| id | timepoint | steps | RPE |
|---|---|---|---|
| ft01 | 1 | 1294 | 8 |
| ft01 | 2 | NA | NA |
| ft02 | 1 | 121 | 3 |
| ft02 | 2 | 51231 | NA |
| ft03 | 1 | NA | NA |
| ft04 | 1 | 1653 | 10 |
| ft04 | 2 | NA | 5 |
| ft05 | 1 | 12341 | 3 |
| ft06 | 1 | 12521 | NA |
| ft06 | 2 | NA | NA |
| id | timepoint | steps | RPE |
|---|---|---|---|
| ft01 | 1 | 1294 | 8 |
| ft01 | 2 | ||
| ft02 | 1 | 121 | 3 |
| ft02 | 2 | 51231 | |
| ft03 | 1 | ||
| ft04 | 1 | 1653 | 10 |
| ft04 | 2 | 5 | |
| ft05 | 1 | 12341 | 3 |
| ft06 | 1 | 12521 | |
| ft06 | 2 |
careful with blank spaces " male " is not the same as "male"
dates can be a pain (especially in Excel!), pick a format and stick to it
YYYY-MM-DD -> 2024-02-16minimise ‘free text’ unless you need it for your research question
keep your file names consistent:
bloodmarkers_processed_20231202.csvbloodmarkers_processed_20240104.csvbloodmarkers_processed_20240207.csvkneeoapaper_v1_20240216.docxMultiple rectangles are ok! As long as they’re linked by identifying variables
| Country | Name | Cases |
|---|---|---|
| Afghanistan 1999 | John Smith | 2523035 |
| Afghanistan 2000 | Julia Proud | 23428 |
| Norway 2003 | Holga Svensson | 60123 |
| Norway 2005 | Erik Bryans | 1012959 |
| Germany 1999 | Klaus Schmidt | 912509 |
| Germany 2005 | Sofia Ellins | 12093 |
| Country | Year | First Name | Surname | Cases |
|---|---|---|---|---|
| Afghanistan | 1999 | John | Smith | 2523035 |
| Afghanistan | 2000 | Julia | Proud | 23428 |
| Norway | 2003 | Holga | Svensson | 60123 |
| Norway | 2005 | Erik | Bryans | 1012959 |
| Germany | 1999 | Klaus | Schmidt | 912509 |
| Germany | 2005 | Sofia | Ellins | 12093 |
Clean up this spreadsheet:
Describe what the variables are, and how they are coded
For you and for the future researcher!
| variable_name | name/description | coding | notes |
|---|---|---|---|
| id | Patient ID | ||
| sex | sex at birth | 1 = females; 2 = males |
self reported |
| dom_hand | dominant hand | 1 = right; 2 = left |
|
| height | height in cm | numeric | |
| highest_ed | highest education achieves | 1 = Year 10; 2 = VCE; 3 = TAFE; 4 = University |
Make a data dictionary for the file you’ve been working on
.xlsx files require Microsoft Excel to run - not everyone has this (though it is still probabl yok to use).csv files are universally readable, efficient etcflowchart TD
A[Raw data\nsource 1] --> B(Data Cleaning)
C[Raw data\nsource 2] --> B
D[Raw data\nsource 3] --> B
B --> E{Cleaned Database}
E --> F(Analysis 1)
E --> G(Analysis 2)
3
Different copies of data
📄 📄 📄
2
Different media
💻 💾
1
Off site
☁️
Questions: m.girdwood@latrobe.edu.au